{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# RASA 项目中数据处理\n",
    "\n",
    "在一个 RASA 项目中，数据必须遵循 [RASA Training data format](https://rasa.com/docs/rasa/nlu/training-data-format/) 才能进行训练，本片文章是一个将 CSV 文件转化成目标格式的例子。\n",
    "\n",
    "原数据在 [raw/orders.csv](raw/orders.csv)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 设置 Jupyter Notebook\n",
    "\n",
    "设置 Jupyter Notebook 的输出，只打印有用的信息，忽略没用的 warning 信息。\n",
    "\n",
    "如果使用 Jupyter Notebook, 请查看[Using Jupyter Notebook](https://rasa.com/docs/rasa/api/jupyter-notebooks/)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "import nest_asyncio\n",
    "nest_asyncio.apply()\n",
    "# Logging settings\n",
    "import logging, io, json, warnings, pprint\n",
    "# logging.basicConfig(level=\"ERROR\")\n",
    "# warnings.filterwarnings('ignore')\n",
    "pp = pprint.PrettyPrinter(indent=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 准备实验数据 (CSV to JSON)\n",
    "\n",
    "将一个 CSV 文件转化成一个可以供 RASA 使用的 JSON 训练集."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import json\n",
    "\n",
    "# 读取文件\n",
    "raw_df = pd.read_csv('raw/orders.csv', delimiter=',', usecols=['refund_Info', 'category_id'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "预览 10 行数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_df.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "数据中有些 column 是空，这里涉及到对缺失数据的处理。另外 category 字段为数字，会对分类造成错误，变成 `category_x` 这样的文本类型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# remove the column which refund_Info column is empty\n",
    "raw_df = raw_df[raw_df.refund_Info.notnull()]\n",
    "\n",
    "# rename the column name\n",
    "raw_df = raw_df.rename({'refund_Info': 'text', 'category_id': 'intent'}, axis='columns')\n",
    "\n",
    "# add prefix category_ to category_id\n",
    "raw_df['intent'] = raw_df['intent'].apply(lambda x: 'category_' + str(x))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "转化成 RASA JSON 格式"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "examples = raw_df.to_dict(orient='records')\n",
    "\n",
    "output = {\n",
    "    \"rasa_nlu_data\": {\n",
    "        \"common_examples\": examples\n",
    "    }\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "输出到文件 `data_exp2/result.json` "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('data_exp2/result.json', 'w', encoding='utf-8') as fp:\n",
    "    json.dump(output, fp, ensure_ascii=False, indent=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## (可选)转化成易读数据集 (JSON to Markdown)\n",
    "\n",
    "RASA 支持 JSON 和 Markdown 两种格式，都可以用来训练，Markdown 更方便阅读，如果有需要，可以将JSON文件转化成Markdown"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [],
   "source": [
    "from rasa.nlu.convert import convert_training_data\n",
    "\n",
    "convert_training_data(data_file=\"data_exp2/result.json\", out_file=\"data_exp2/result.md\", output_format=\"md\", language=\"zh\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## 验证数据是否符合 RASA 项目的训练要求\n",
    "\n",
    "> [Valiate Files](https://rasa.com/docs/rasa/user-guide/validate-files/) 文档上给的代码是错误的。需要提交 Pull Request 来修复。\n",
    "\n",
    "> // TODO, 下面这个验证会失败，但不影响实验."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "is_executing": false,
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "import logging\n",
    "from rasa import utils\n",
    "from rasa.core.validator import Validator\n",
    "from rasa.importers.importer import TrainingDataImporter\n",
    "\n",
    "dataImporter = TrainingDataImporter.load_nlu_importer_from_config(config_path=\"config.yml\",\n",
    "                                                                  training_data_paths=\"data_exp2/result.md\")\n",
    "\n",
    "validator = await Validator.from_importer(importer=dataImporter)\n",
    "\n",
    "validator.verify_all()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 训练 NLU 模型\n",
    "\n",
    "根据如下代码训练 NLU 模型，[参考文档](https://github.com/RasaHQ/rasa-workshop-pydata-dc/blob/master/rasa-pydatadc-workshop-completed.ipynb),\n",
    "该文档为旧版本的 rasa 语法。需要根据新版本的语法，并且改成中文训练"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "config = \"config.yml\"\n",
    "training_files = \"data_exp2/result.json\"\n",
    "domain = \"domain.yml\"\n",
    "output = \"models/\"\n",
    "\n",
    "import rasa\n",
    "model_path = rasa.train(config, [training_files], output)\n",
    "print(model_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 查看 NLU 模型的输出结果\n",
    "\n",
    "手动查看一些 NLU 训练模型的输出结果\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "实验结束。Have a nice day!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  },
  "pycharm": {
   "stem_cell": {
    "cell_type": "raw",
    "metadata": {
     "collapsed": false
    },
    "source": []
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
